Welcome back to deep learning. So today we want to talk about regularization techniques and we start
with a short introduction to regularization and the general problems of overfitting. So we will
first start talking about the background and ask the question what is the problem of regularization.
Then we want to talk about classical techniques, normalization, initialization, transfer learning
and multitask learning. So why are we talking about this topic so much? Well if you want to fit your
data then problems like this one would be easy to fit as they have a clear solution. Typically you
have the problem that your data is noisy and yet you cannot easily separate the classes. So what
you then run into is the problem of underfitting if you have a model that doesn't have a very high
capacity. Then you may have something like this line here which is not a very good fit to describe
the separation of the classes. The contrary is overfitting. So here we have models with very
high capacity which try to model everything that they observe in the training data. This may yield
decision boundaries that are not very reasonable. What we are actually interested in is a sensible
decision boundary that is somehow a compromise between the observed data and their actual
distribution. So we can analyze this problem by the so-called bias-variance decomposition.
Here we stick to the regression problems where we have an ideal function h of x that computes
some value and it's typically associated with some measurement noise. So there's some additional
value epsilon that is added to h of x. It may be distributed normally with a zero mean and
standard deviation of sigma. Now you can go ahead and use a model to estimate h. This is
denoted as f hat that is then estimated from some data set D. We can now express the loss for a
single point as the expected value of the loss. This would then simply be the L2 loss. So here
we take the true function minus the estimated function to the power of 2 and compute the
expected value. Interestingly this loss can be shown to be decomposable into two parts. This
is the bias which is essentially the deviation of the expected value of our model from the true
model. So this essentially measures how far we are off the ground truth. The other part can be
explained by the limited size of the training data set. We can always try to find a model that is
very flexible and tries to reduce the bias. What we get as a result is an increase in variance. So
the variance is the expected value of y hat minus the current value of y hat to the power of 2. This
is nothing else than the variance that we encounter in y hat. Then of course there is a small
irreproducible error. Now we can integrate this over every data point in x and we get the loss
for the entire training data set. By the way a similar decomposition exists for the classification
using the 1-0 loss which you can see in reference number 9. It's slightly different but it has some
similar implications. So we learn that with an increase in variance we can essentially reduce
the bias which means the prediction error of our model on the training data set. Let's visualize
this a bit. On the top left we see a low bias low variance model. This is essentially always right
and it doesn't have a lot of noise in the predictions. In the top right we see a high bias model that is
very consistent which means it has a low variance and is consistently off. In the bottom left we see
a low bias high variance model. This has a considerable degree of variation but on average
it's very close to where it's supposed to be. In the bottom right we have the case that we want
to omit. This is a high bias high variance model which has lots of noise and it's not even where
it's supposed to be. So we can choose a type of model for a given data set but simultaneously
optimizing bias and variance in general is impossible. Bias and variance can be studied
together as model capacity which we'll take a look at on the next slide. The capacity of a model
describes the variety of functions it can approximate. This is related to the number of
parameters so people often say that the number of parameters increase the number of parameters and
then you can get rid of your bias. This is true but it's by far not equal. To be exact you need
to compute the Wapnik-Chervonenkis dimension. This VC dimension is an exact measure of capacity
and it's based on counting how many points can be separated by the model. So the VC dimension of
neural networks is extremely high compared to classical methods and they have a very high model
capacity. They even manage to memorize random labels if you remember reference number 18. That's
again the paper that was looking into learning random labels on ImageNet. The VC dimension is
Presenters
Zugänglich über
Offener Zugang
Dauer
00:11:02 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 15:06:17
Sprache
en-US
Deep Learning - Regularization Part 1
This video discusses the problem of over- and underfitting. In order to get a better understanding, we explore the bias-variance trade-off and look into the effects of training data size and the number of parameters.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning